applications
objectives
methods
interpretation
February 22, ’23
applications
objectives
methods
interpretation
aid in discovery of new populations of imperiled plants (Franklin (2010))
aid in creation of reserves under climate change models
aid in predicting joint species distributions, i.e. obligate mutualisms
aid in modelling spread of noxious species to novel ranges (but see Liu et al. (2020))
using known occurrences of a species, identify areas which have similar habitat and the potential to support populations
but, what about dispersal?
competition?
mutualisms?
Besseya (=Synthris) alpina (A. Gray) Rydberg.
American Basin
B. alpina, Franklin #3948
define spatial domain and grain
software environments
dependent variables
independent variables
modelling approaches
model evaluation
predicting a model into space
domain; spatial extent of study
- administrative boundary
- ecological model
grain; scales in space and time
- resolution at which process occurs (space)
- current and past climate (time)
- projected climates
- (animals) seasonal patterns?
limitation: compute power
Domain
Occurence Records
domain: continental (e.g. North America)
- maximum and minimum daily temperatures [monthly, 4km]
- precipitation [monthly, 4km]
- hydrologic drainage [millenial, 4km]
domain: regional (e.g. Southern Rockies)
- elevation [millenial, 1km]
- soil classes [millenial, 1km]
- solar radiation [millenial, 1km]
- precipitation form [monthly, 1km]
domain: fine (e.g. McDonald Woods)
- micro topography [decade, 1m]
- water relations [decade, 1m]
- shade [weekly, 1m]
- soils [decade, 1m]
problems with all models
- garbage.(in) -> garbage.(out)
- influential outliers
with machine learning;
- models can fixate on these observations
solution:
- run many models, synthesize the results
“we are stronger together than we are alone” - Walter Payton
explicitly check for variation
carefully encode categorical data
too much, may not be useful
too little, may not be useful
pilot knock out studies; use one variable leaving the others out
warrants simplifying a variable?
t.test the difference in values between presence and absence points
correlated!
much more common approach than individual linear models
many ‘weak learners’
species distributions are generally too complex for individual predictors, and building fully interactive terms would take a long time.
the typical approach since the late 90’s
do the work for you
none, get a few observations, the more the merrier.
no free lunch
try many types of models, select some that work for your application
common algorithms (families of models):
Practitioners are always wrong.
How do you want to be wrong?
the downsides of predicting suitable habitat where it isn’t?
the downsides of predicting non-suitable habitat where it really is?
What is the cost of ‘better’ predictions?
other projects? priorities? deadlines?
\[ Accuracy = \frac{\text{correct classifications}}{\text{all classifications }} \]
\[ Sensitivity = \frac{\text{true positives}}{\text{true positives + false negatives }} \] probability of the method giving a positive result when the test subject is positive.
\[ Specificity = \frac{\text{true negatives}}{\text{true negatives + false positives }} \] probability of the method giving a negative result when the test subject is negative
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output ## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
| Accuracy | Sensitivity | Specificity |
|---|---|---|
| 0.997 | 0.969 | 0.945 |
| 0.994 | 0.969 | 0.945 |
| 0.995 | 0.945 | 0.910 |
suitability
keep a lab notebook; this is bench science
always start models small (avoid computer crashes)
use strong and discrete directory organization
scratch paper, whiteboards, flowcharts
dynamic programming; import/export data
track code on github
two hour discussion of the ‘sdm’ package by an author
large repository for high throughput modelling
large repository about spatial data in R
short activity using a sdm like process to teach spatial data
Ensemble learning utilizes many sets of trees, each tree being composed of many binary decisions, to create a single model. Each independent variable ( - or feature) may become a node on the tree - i.e. a location on the tree where a binary decision will move towards a predicted outcome. Each of the decision tree models which ensemble learning utilizes is a weak model, each of which may suffer due to high variance or bias, but which produce better outcomes than would be expected via chance. When ensembled these models generate a strong model, a model which should have more appropriately balanced variance and bias and predicts outcomes which are more strongly correlated with the expected values than the individual weak models.
Random Forest (RF) the training data are continually bootstrap re-sampled, in combination with random subsets of features, to create nodes which attempt to optimally predict a known outcome. A large number of trees are then aggregated, via the most common predictions, to generate a final classification prediction tree. Each individual prediction tree is generated independently of the others.
Boosted Regression Tree (BRT) (or Gradient Boosted tree) An initial tree is grown, and all other trees are derived sequentially from it, as each new tree is grown the errors in responses from the last tree are weighed more heavily so that the model focuses on selecting dependent variables which refine predictions. All response data and predictor variables are kept available to all trees.
A. Lee-Yaw, Julie, Jenny L. McCune, Samuel Pironon, and Seema N. Sheth. 2022. “Species Distribution Models Rarely Predict the Biology of Real Populations.” Ecography 2022 (6): e05877.
Barbet-Massin, Morgane, Frederic Jiguet, Cecile Helene Albert, and Wilfried Thuiller. 2012. “Selecting Pseudo-Absences for Species Distribution Models: How, Where and How Many?” Methods in Ecology and Evolution 3 (2): 327–38.
Franklin, Janet. 2010. Mapping Species Distributions: Spatial Inference and Prediction. Cambridge University Press.
Hijmans, Robert J. 2022. Terra: Spatial Data Analysis. https://CRAN.R-project.org/package=terra.
Kuhn, Max. 2022. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.
Liu, Chunlong, Christian Wolter, Weiwei Xian, and Jonathan M Jeschke. 2020. “Species Distribution Models Have Limited Spatial Transferability for Invasive Species.” Ecology Letters 23 (11): 1682–92.
Naimi, Babak, and Miguel B. Araujo. 2016. “Sdm: A Reproducible and Extensible r Platform for Species Distribution Modelling.” Ecography 39: 368–75. https://doi.org/10.1111/ecog.01881.
Pebesma, Edzer. 2018. “ Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
Senay, Senait D, Susan P Worner, and Takayoshi Ikeda. 2013. “Novel Three-Step Pseudo-Absence Selection Technique for Improved Species Distribution Modelling.” PloS One 8 (8): e71218.